Masthead

Lab: Exploring Data

This lab will have us exploring data to determine the nature of the data before we begin modeling.

Resources

Chapters 3 and 4 of the book cover a portion of this material and have additional information on auto correlation. The following sections of the R For Spatial Statistics web site may also be of help:

Activities

Select a data set that is a point set within the United States (it's really challenging to find covariate layers outside the US) that includes a measured value for a response variable (presence/absence, height, DBH, weight, etc.). The data set will typically be point locations for a species but also needs to include a measurement.

If you have a data set that is not a set of points with continuous values, you can use one of the methods on the web page Converting Layers to change your data into a set of points. If you do not have a data set, you may use one of the one's below:

Collect a set of covariate layers for your data set. Set up this data set as a shapefile or CSV file and extract the values from the covariate layers into your point data set using ArcGIS, BlueSpray or R. The code to extract raster values is at the bottom of the R website raster page. Covariate layers such as precipitation and temperature are available from BioClim and PRISM on the Web. There are also potential candidates for covariates on the Schedule web page for this class.

 

1. Spatial Issues

Map your data set and covariates and perform a visual inspection of the data at various resolutions. Use symbols to map your response variable values at your points and your covariate values (i.e. attribute values). Change the min/max range of values that are displayed in your GIS for rasters to find the precision of your values and hillshade them to find artifacts.

Question 1: Did you find anything interesting in your data at this point? How will these issues effect your modeling?

 

2. Create histogram(s) of the response variable(s)

To read the data into R, you can use the "read.csv(...)" function. It is a good idea to run "na.omit(...)" to move any cells that have blank entries. You may get an error on cells that have non-numeric entries. Use MS-Excel to remove these rows. Make sure to make a note of the change in the number of rows.

# Read the data
TheData = read.csv("C:\\Users\\jim\\Desktop\\GSp 570\\Lab3 Exploring Data\\SweetGum1_NoNones.csv") 

# Remove any NA (null) values
TheData=na.omit(TheData)

# Make a histogram of one of the response (independent variables) 
hist(TheData$MaxHt,breaks=150)

Question 2: Do the histograms appear as you expect? Are there any artifacts that might impact your models? Did you find any other issues?

 

3. Create histograms of your covariate values using the values in the CSV file and the original raster values.

Below is the code to read a raster and create a histogram:

library(raster)
library(rgdal)
AnnualMeanTemp=raster("C:\\Users\\jim\\Desktop\\GSp 570\\Lab3 Exploring Data\\BioClim_NorthAmerica_4km\\bio_1_AnnualMeanTemp_cropped.tif")
hist(AnnualMeanTemp,breaks=100)

Question 3: Do you see any artifacts in the histograms of the original rasters? Did anything intersting change when you histogrammed the covariate values in the CSV file?

3. Create histograms (categorical response) and/or scatter-grams (continuous response) of your covariate values and your response values.

You can create a scattergram by just plotting your response against your covariate values:

plot(TheData$AnnMeanTmp,TheData$MaxHt)

Question 4: Do the covariates appear to "co-vary" with anything in the response variable? What type of "curve" might these be modeled with?

4. Use the R function cor(vector1,vector2), or another methods, to determine if there is a correlation between each of the predictor variables. Also create a correlation plot between all of the covariates.

See the R web site coorelation page for more information.

Question 5: Are there any strong correlations between the covariates?

 

© Copyright 2018 HSU - All rights reserved.